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ABSTRACT 


Viewing worked examples before problem solving has been 
shown to improve learning efficiency in novice programming. 
Example-based feedback seeks to present smaller, adaptive 
worked example steps during problem solving. We present a 
method for automatically generating and selecting adaptive, 
example-based programming feedback using historical stu- 
dent data. Our data-driven feature-based (DDF) example 
generation method automatically learns program features 
from data and selects example pairs based on when students 
complete each feature. We performed an experiment to com- 
pare three example generation methods: Student trace data, 
Data-Driven Features (DDF), and Expert examples. Two 
experts rated the quality of feedback for each generator, 
and they rated both the Expert and DDF example feed- 
back as significantly more relevant to students’ goals than 
the Student example feedback. However, there were no sig- 
nificant differences between the DDF and Expert examples. 
We compared these approaches to one that combined DDF 
with an Interactive Selection step (DDF-IS), where the user 
(in this case, an expert) selects their preferred data-driven 
feature before an example is selected. DDF-IS produced sig- 
nificantly more relevant examples than all other approaches, 
with significantly higher overall example quality than DDF. 
This suggests that our DDF approach allows more relevant 
examples to be selected than existing approaches, and that 
we may be able to leverage interactivity with the student to 
further improve example quality. 


1. INTRODUCTION & BACKGROUND 


Prior studies show that worked examples are an effective in- 
structional support to help novices learn complex tasks [27, 
30, 28, 8]. Sweller argues that “..for novices, learning via 
worked examples should be superior to learning via problem 
solving.” [27]. In the domain of programming, researchers 
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also suggest using worked examples to teach novices [8, 31]. 
Empirical studies have shown that interleaving worked ex- 
amples with similar practice problems is more effective than 
solving only equivalent programming problems by writing 
code, as students spent less time on training tasks and per- 
formed better on a posttest [30]. However, worked examples 
are traditionally only offered to students in between prob- 
lem solving attempts [30, 16, 7, 6], and they do little to as- 
sist students when they have difficulty during problem solv- 
ing. Based on the idea of worked examples, researchers have 
explored example-based feedback [9, 4, 13], which shows a 
correct piece of code to help students learn during problem 
solving, as a form of adaptive, on-demand support [9]. Sim- 
ilar to a high-level on-demand hint, example-based feedback 
demonstrates one step in a correct solution to the problem 
the student is working on, selected adaptively to match the 
student’s code. Keuning et al. argue for the need for such 
feedback in their review on automated feedback generation 
for programming, saying “the very low percentage of tools 
that give code examples based on the student’s actions is 
unfortunate, because studying examples has proven to be 
an effective way of learning” [15]. As Gross et al. argue, 
showing a relevant partial solution could impose less cogni- 
tive load and be easier to visually present to novices than 
showing a full solution [9]. Ichinco et al. found that novices 
have trouble transferring what they learned from similar 
problems to their own code [14]. Example-based feedback 
addresses this by providing example steps from the same 
problem the student is working on. Figure 1 presents a pro- 
totype of what example-based feedback might look like in a 
block-based novice programming environment. The exam- 
ple is presented with a “before” state, similar to the stu- 
dent’s current code, and an “after” state that completes a 
desired feature, helping the student easily identify the pur- 
pose and outcome of a single solution step. 


Existing example-based feedback systems may rely on a li- 
brary of expert-authored examples [11, 14, 4], which can be 
costly to maintain, as instructors are unlikely to create new 
examples [12]. Additionally, the example code may show 
an example solution to a related problem [11] that requires 
students to transfer knowledge to solve their current prob- 
lem, which may be challenging for weaker students who need 
more help [13]. To address these limitations, we present a 
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Figure 1: Example-based feedback prototype in iSnap. 


data-driven method to create example-based feedback using 
historical student data. We focus on the domain of program- 
ming, where many problems have a vast space of possible 
solutions [24, 19], so a static set of examples is unlikely to 
be relevant to all students. To our knowledge, only Gross 
et al. have previously employed a data-driven method to 
derive examples adaptively to help students [9]. They found 
that examples derived from expert solutions are perceived 
by students as more helpful and help students make more 
solution improvements than those derived from student solu- 
tions. However, they simply presented the complete student 
solution which was most similar to the current student’s 
code (according to a distance metric), rather than trying to 
systematically identify and address students’ current pro- 
gramming goals. This suggests that existing evaluations of 
data-driven methods for example-based feedback have not 
explored the full potential of the approach. We hypothesize 
that with the innovations in this work, data-driven examples 
can be more adaptive than expert-authored ones, while still 
presenting correct and interpretable solution steps. 


In this work, we present an approach to automatically detect 
students’ progress towards a solution, and suggest example- 
based feedback from historical student data. Our data- 
driven method first processes prior correct student solutions 
to discover meaningful features, labels the parts of code that 
contribute to each feature, and removes code that does not 
contribute to the solution, to create simple data-driven code 
examples that contain only the code needed for each feature. 
We then automatically label the student’s current code with 
a “feature state" representing the presence or absence of each 
data-driven feature needed for a correct solution. We then 
adaptively select example feedback that contains the same 
features as the current student code, and adds a new feature 
that is relevant to solving the current problem. In contrast, 
other example feedback systems show code from a related 
similar problem, requiring students to study the example 
and transfer what they learn to the current context [11]. 


We evaluated two data-driven methods to generate and se- 
lect example-based feedback for historical student hint re- 
quests: Data-Driven Features (DDF), and Data-Driven Fea- 
tures with Interactive Selection (DDF-IS). We compared 
these methods against two baselines: (1) Expert-authored 
examples (Expert) and (2) examples generated naively from 
correct student solution traces, showing the code added be- 
tween consecutive test runs (Student). The data-driven fea- 
tures (DDF) algorithm cleans prior students solutions and 
generates example pairs where the “start code” has the same 
features as the student’s code, and the “end code” adds a 
new feature that is not yet present in the student’s code. 


The DDF with Interactive Selection (DDF-IS) approach ex- 
plored the potential to improve algorithmic DDF example 
feedback selection by having a user interactively select the 
data-driven feature with which they want help, before the 
algorithm selects an example. To simulate this experience, 
we used an expert to select this feature, representing a best- 
case scenario for interactive selection. 


We adapted a multidimensional data-driven hint evalua- 
tion rubric from previous work to evaluate the example- 
based feedback quality based on Relevance, Progress, In- 
terpretability, and Similarity. Our findings showed both the 
Expert and DDF feedback were significantly more relevant 
to students’ goals than the Student feedback, but there were 
no significant differences between the DDF and Expert ex- 
amples. This suggests that our DDF feedback can reason- 
ably replace Expert-authored examples in situations where 
they are unavailable or difficult to scale. We also found that 
DDF-IS produced significantly more relevant examples than 
all other approaches, with the highest overall example qual- 
ity, significantly higher than DDF. This suggests that in the 
best case, data-driven feedback may leverage interactivity 
with the student to further improve example quality. 


The contributions of this paper are: 1) a data-driven algo- 
rithm capable of generating adaptive example feedback for 
students during programming, and 2) an initial evaluation 
showing that these adaptive examples can be more relevant 
than static, expert-authored examples. 


2. METHOD 


This work presents and evaluates a data-driven feature-based 
(DDF) method for generating and selecting example-based 
feedback (explained in Section 2.2). To evaluate our DDF 
approach, we generated example-based feedback for histor- 
ical student help requests, and asked experts to evaluate 
the quality of each type of feedback, comparing against two 
baselines (explained in Section 2.3.1). 


2.1 Dataset 


Our dataset comes from iSnap [20], which extends the Snap! 
block-based programming environment with logging and on- 
demand, data-driven hint support. It logs student interac- 
tions with the system, including complete code snapshots 
after each edit. The data were collected during the Fall 2016 
(F16), Spring 2017 (S17), and Fall 2017 (F17) semesters in 
an introductory computing course for non-majors, held at 
a research university’. In each semester, the students com- 
pleted 3 in-lab assignments with access to help from teaching 
assistants and 3 homework assignments independently. In 
this paper, we selected one homework assignment Squiral 
for the example code generation and evaluation. In Squiral 
(shown in Figure 2), students program a “sprite” to draw a 
spiraling square-like shape using loops, variables, arithmetic 
operators and a custom block (function). Common solutions 
for Squiral contain 7-10 lines of code. The original dataset 
contains 57 (F16), 43 (S17), and 47 (F17) Squiral assignment 
submissions. Since iSnap offers students on-demand, data- 
driven hints which may alter students’ problem-solving pat- 
terns, we exclude students who requests hints in the dataset 


‘All datasets are available at https://pslcdatashop.web. 
cmu.edu/Project?id=321 
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Figure 2: Squiral pseudocode, Snap! code and output [34]. 


used for the example-based feedback generation. Our re- 
maining data contained 38, 29, 39 code traces for the F16, 
S17, and F17 semesters, respectively. Each code trace con- 
tains the set of timestamped snapshots that comprise all of 
a student’s work on a problem. 


2.2 Example-based Feedback Generation 

We propose an algorithm to automatically generate data- 
driven, example-based feedback with the following high-level 
steps: 1) Extract a set of data-driven features from prior stu- 
dent code traces, which each describe a property of a cor- 
rect solution. 2) Use the features to clean correct student 
code traces by removing extraneous code that does not con- 
tribute to a correct solution. 3) Generate pairs of example 
code snapshots that demonstrate how to complete a single 
feature. 4) Choose an appropriate, personalized example 
pair upon student request. 


Our goal in this work is to perform a preliminary evaluation 
of our algorithm before implementing a user interface and 
evaluating its impact in practice. However, to contextual- 
ize our work, Figure 1 presents one way that such feedback 
could be presented to a student in iSnap when they request 
help. Our DDF example pairs consist of “start code,” simi- 
lar to the student’s, and “end code” that demonstrates how 
to complete a single feature. Our design draws inspiration 
from a variety of theoretical and empirical sources, includ- 
ing cognitive load theory [29], Vygotsky’s Zone of Proximal 
Development [32], worked examples [27], learning from sub- 
goals [16], and compare/contrast tasks [17]. 


Research on worked examples suggests that seeing exam- 
ples which break a problem down into sequential steps (e.g. 
features) can be a more efficient way of learning than prob- 
lem solving [27, 30, 28]. According to cognitive load the- 
ory, worked examples are effective because they lower the 
extraneous cognitive load (mental effort) imposed by the in- 
structional materials. We designed our examples to present 
steps as a pair of “start” and “end” code, as previous studies 
have shown that comparing and contrasting examples is an 
effective learning activity in many domains [17, 26]. The ex- 
ample is adaptively selected to keep students in the Zone of 
Proximal Development [32] by starting with code similar to 
the student’s, which they can already understand, and scaf- 
folding the completion of a new feature, which they cannot 
yet accomplish on their own. The work of Morrison et al. 
[16] suggests that programming examples that are broken 
into subgoals can help improve learning for novices. When 
examples are isomorphic to the problem solving task, as in 
our case, they found that it is most effective for students to 
label subgoals themselves, a feature we could easily incor- 
porate into our example-based feedback. 


2.2.1 Step 1: Data-driven Feature Generation 

The goal of an example pair is to present how a meaning- 
ful self-contained portion of solution code, or feature, can 
be completed. An assignment may have students program 
multiple features, and the final correct solution should have 
all the correct features present. For example, in Squiral 
(shown in Figure 2), a feature could be to move the sprite 
in a square shape, draw some figure on the screen, or re- 
peat the spiral the correct number of times. In our previ- 
ous work [34], we manually defined expert-authored features 
(shown in Table 1) in a systematic way, and we also imple- 
mented a data-driven algorithm to automatically identify 
code features from student solutions. Our results showed 
that many of the data-driven features were easily inter- 
pretable and closely matched the expert-authored features. 
The two methods also had moderate agreement on whether 
a given student was in the same state or different states. 


The full procedure for data-driven feature extraction is given 
in [34], but we outline its high-level steps as follows: 


1) Preprocess student solutions: Some student solutions 
may contain extraneous code or procedures that were used 
for testing or resetting the environment, which we attempt 
to remove before extracting features. Specifically, we used 
the SourceCheck algorithm [21] to identify and remove whole 
scripts and procedures that do not match any element of an 
expert-authored solution in the correct student solutions. 


2) Generate code shapes: To identify common code pat- 
terns in correct student solutions, we extract a set of code 
shapes, or syntactic structures, from the solution code by 
converting students’ solution code into abstract syntax trees 
(ASTs) and then identify all pq-Gram subtrees [1] in each 
AST, to form our initial set of “code shapes.” This includes 
all code shapes from all correct solutions. 


3) Remove duplicates: The initial set of code shapes may 
include very similar shapes, including some AST patterns 
that are subsets of others. Therefore, we remove these du- 
plicates by measuring the co-occurrence of code shapes in all 
student code traces and keeping only the more specific code 
shape and discard the other duplicate if two code shapes 
almost always appear together in the same code. 


4) Identify decision shapes: Due to varied problem solv- 
ing strategies, some code shapes may not appear in every 
correct solution. For example, a Squiral solution can either 
use nested repeat or a single repeat block to rotate the cor- 
rect number of times (as shown in Figure 2), but not both. 
Therefore, we define a decision shape as a disjunction of 
code shapes, where almost all solutions contain exactly one 
of the component code shapes. A decision shape is present 
in a solution if any one of its code shapes exist in its code. 


5) Filter out uncommon code and decision shapes: 
Since we are interested in using code and decision shapes to 
represent features of a correct solution, we keep only those 
shapes which appear in the vast majority of correct solu- 
tions, and filter out the rest. 


6) Form features: Our goal is to define a small set of fea- 
tures that collectively represent a complete solution. How- 
ever, the previous steps will generate tens to hundreds of 
code and decision shapes for a relatively simple problem 
like Squiral. We therefore combine these smaller shapes into 
larger features using a form of hierarchical clustering. We 
iteratively combine any two features that most frequently 
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Table 1: Expert Features and Corresponding Data-Driven Features, as derived in [34]. 


Feature Name 
E1. Procedure 


Brief Description 


E2. Draw Anything 


E3. Move ‘Square -like’ 
E4. Correctly Use 


fashion. 
Correctly uses parameter 


Parameter within custom block. 
E5. Repeat Correct Repeats square-like movement correct 
# of Times number of times. 


Movement is based on a variable not 


E6. Move ‘Variably’ 
E7. Move ‘Squirally’ 


literal amount. 


Primary code inside of a procedure. 
Able to draw anything on screen. 


Able to move sprite in a square-like 


Increase length to move for each side. 


Data-driven Analogue 

D1: Create a procedure OR a variable. 
D8: Use a ‘repeat’? AND create a 
variable OR a parameter. 

D11: Have a ‘move’ AND a ‘turn’ ina 
‘repeat’? AND have a ‘pen down’ 


D4: Have a ‘repeat’ inside a procedure. 


D5: Have a ‘multiply’ block with a 
variable OR two nested ‘repeats’. 

D10: Have a ‘move’ with a variable 
argument inside of a ‘repeat’. 

D7: Change a variable inside a ‘repeat’. 


co-occur across student data until the size of the observed 
state-space defined by the features starts to decrease rapidly. 


7) Represent student code by feature vectors: Once 
the features are formed, we can represent a student’s current 
code as a vector indicating the presence or absence of each 
feature. A student starts with a feature state of all Os, and 
a correct solution should have all features present, resulting 
in a vector of all 1s. 


2.2.2 Step 2: Cleaning Student Code 

To extract good example pairs from prior student traces, we 
need to first remove excess blocks which do not contribute to 
a correct solution, as these may distract students and make 
examples harder to interpret. Identifying the excess blocks 
can be difficult, especially for intermediate partial solutions, 
since students can construct solutions in a large variety of 
ways [21, 24]. We address this by leveraging the features we 
defined in step 1 to exclude irrelevant code, which does not 
belong to any feature. Specifically, our cleaning procedure 
removes one node from the abstract syntax tree at a time 
(including all its children), then checks whether removing 
this node causes a currently completed feature to become 
incomplete. If removing a node “breaks” a feature, we as- 
sume that it is necessary and add it back; otherwise, we re- 
move it. We also check to make sure removing the node does 
not break any code dependencies, such as deleting a variable 
declaration when the variable is used elsewhere. We iterate 
over every node in a recursive, breadth-first manner, starting 
from the root node. Once this iteration stops, it produces a 
cleaned partial solution, where all irrelevant code has been 
removed. We apply this cleaning procedure to all snapshots 
in correct solution traces. With well-defined features, this 
process can effectively clean a large variety of both partial 
and complete solutions, ensuring that all remaining code is 
useful. We use this process both for cleaning code and for 
extracting example pairs, as described in the next step. 


2.2.3 Step 3: Extract Example Pairs 

Our goal is to create a database of correct, meaningful, and 
self-contained example pairs to offer as feedback to students. 
Naively, we could extract a single example pair for each fea- 
ture from each correct code trace, since each student com- 
pleted each feature at least once. However, we want to gen- 
erate as many example pairs as possible, so that the algo- 
rithm can adaptively select one that is similar to another 


student’s code. We therefore developed a method to gener- 
ate many “synthetic” example pairs, each consisting of a pair 
of code states (co, c1)i, from any cleaned student code trace. 
Recall that each pair should cleanly demonstrate the com- 
pletion of exactly one feature by contrasting a “start code” 
state (co) and an “end code” state (ci). The algorithm first 
extract one example pair each time a student completes a 
feature f;, with c; defined as the snapshot right after f; was 
completed, and co as the snapshot right after the prior fea- 
ture f;-1 was completed. We generate additional example 
pairs from each cleaned snapshot in a student solution trace 
with the following procedure. For each snapshot, the algo- 
rithm labels it as an end state, ci. It then removes exactly 
one data-driven feature from c; to create a co, and together 
these form the example pair. This feature removal is accom- 
plished using the code cleaning procedure described above; 
however, instead of removing irrelevant nodes, we use it to 
remove whole features. The algorithm first tries to remove 
one leaf node, l;, at a time. Since the snapshot has already 
been cleaned, removing this node will either create an in- 
valid code state, or cause a feature to become incomplete. 
In the later case, the cleaning procedure was run to remove 
all other code associated with the removed feature. The re- 
sulting cleaned code becomes the co for the example pair, 
and our cleaning procedure guarantees that co will have ex- 
actly one less feature than ci. The (co, c1); pair is added to a 
list which stores our example pairs. The algorithm then re- 
peats this process recursively on co, which becomes the c; for 
new example pairs, until no new pairs can be generated. In 
this case, we generate many example pairs per snapshotin a 
solution trace. While some are redundant, many are unique. 


2.2.4 Step 4: Select an Example Code Pair 

When a student requests help, we aim to provide them with 
the most appropriate example pair in our database as feed- 
back. We define two ways that we can identify this example 
pair: 1) a Data-Driven Features (DDF) approach, using 
an algorithm to select the best example pair, or 2) an Data- 
Driven Features with Interactive Selection (DDF- 
IS) approach, giving the student the information needed to 
select an example pair. We first consider the DDF approach, 
in which we attempt to select the example pair which is most 
similar to the current student’s code. Based on this selec- 
tion criteria, the selected example code pair should be very 
similar to the student’s code, with the goal of minimizing 
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the effort needed to process the “start code” and allowing 
the student to focus on the feature demonstrated by the ex- 
ample pair. In this study, for the DDF feedback, we selected 
an appropriate example code pair as follows: 


Generate proper example pair candidates: To filter 
out inappropriate pair candidates based on the student’s 
completed features, we select only those example pairs whose 
“start code” has the same features as the current student 
code, from the example pair lists (generated in Section 2.2.3). 
If we cannot find any, we sort the example pairs by the Ham- 
ming distance between the feature states of “start code” and 
student’s current code. Then we select the example pairs 
with the closest start-state to student’s current code. If 
there are multiple example pairs available (they all have the 
same start state), we use the SourceCheck algorithm to sort 
the example pairs based on the similarity between the “start 
code” and student’s current code. 


Select one example pair from candidates: We iterate 
over the example pair candidates and select exactly one pair, 
which accomplishes a feature that the student has not fin- 
ished yet, preferring features that the majority of students 
in a state similar to the current student will take next. 


To generate interactive example-based feedback, we need a 
way to communicate to the student which features have an 
available example that they can request. By default, our fea- 
tures are unlabeled (having been generated automatically 
from data), but with a small amount of instructor effort 
(about 3 minutes), they can be labeled. By viewing the 
code shapes required by each feature, an instructor famil- 
iar with the problem can generate a short, human-readable 
description, such as “move the sprite using a variable", or 
“make the sprite move further each time". These options 
can then be shown to a student. Once the student has se- 
lected a feature for an example, we select an appropriate 
example pair that accomplishes that feature with start code 
matching the student’s current feature state. If there are 
multiple options, we use the same criteria as in DDF: select 
the example pair with the most similar start state to the stu- 
dent’s code, which contains a proper subset of the student’s 
features. Note that unlike in DDF, a student could request 
an example for a feature that they have already completed. 
Students might decide to do this when they are unsure if 
they have completed a feature correctly and want to see an 
example for confirmation. 


2.3. Expert Evaluation 

To test the feasibility of our method before building the 
whole system and conducting a user study, we did a prelim- 
inary expert evaluation of our algorithm. We generated ex- 
ample pairs to support student code snapshots from our his- 
torical dataset and evaluated their quality. To simulate real 
student help requests where an example might be needed, we 
selected snapshots that corresponded to times when histori- 
cal students requested hints from iSnap. As in prior work on 
evaluating feedback [22], we sampled up to two hint requests 
(and their corresponding student code snapshots) from each 
student. We sampled 50 hint requests in total including 20 
in F16, 20 in $17, and 10 in F17. For each hint request 
we generated four example code pairs using different tech- 
niques: DDF, DDF-IS, and our two baselines, Student and 


Expert (explained in the next section). For the examples de- 
rived from student data (DDF, DDF-IS, Student), we gener- 
ated examples using semester-based 3-fold cross-validation. 
For each semester, we used the other two semesters’ data as 
training data to generate the examples. We derived 9, 9, and 
10 data-driven features for F16, 517, and F17, respectively. 


Two co-authors, who neither authored examples nor worked 
on the algorithm itself, served as experts to evaluate the 
generated example pairs. Both experts have extensive ex- 
perience in Snap! and the Squiral assignment. We built an 
interface in iSnap to present each expert with the student’s 
original code and the example pair. Then we asked the ex- 
perts to assess the example pair based on a detailed example 
code rating rubric”, adapted from [22]. Our rubric has 4 at- 
tributes, each rated 1, 2 or 3, with higher scores being better. 
These 4 attributes measured: 1) Relevance: how relevant 
the suggested example code pair is to the student’s current 
goals, 2) Progress: how well the example code pair helps 
students make progress towards the final correct solution, 3) 
Appropriateness & Interpretability: how likely a tutor 
will be to suggest this example pair to a student and how 
easily a novice could understand the intention of the sug- 
gested example pair and 4) Similarity, how similar is the 
“start code” to the student’s code. The first 3 attributes are 
meant to assess the quality of the example. The 4th is meant 
to help us understand the relationship between example sim- 
ilarity and quality, since all examples were selected based on 
their similarity to student code. During evaluation, experts 
had access to students’ code history, and based their ratings 
on the student’s individual context. 


To ensure that the two experts had a similar understanding 
of the rubric, they rated 10 examples together, which were 
not used in this study. They then rated the 200 examples 
pairs used in this study in 2 rounds of 100 each®. In Round 
1, they independently rated 40 example pairs and then dis- 
cussed their ratings to resolve any conflicts and reach con- 
sensus. Their inter-agreement reliability across the 40 exam- 
ple pairs achieved squared-weighted Cohen’s kappas of 0.94, 
0.91, 0.82, 0.90 for Relevance, Progress, Appropriateness & 
Interpretability, and Similarity, respectively, indicating very 
strong agreement. They then split the remaining 60 pairs 
and rated them individually*. In Round 2, one expert rated 
the remaining 100 pairs individually, and the other expert 
rated 75 of these, which were discussed until consensus was 
reached. Across the 115 example pairs that were rated by 
both experts, the total squared-weighted Cohen’s kappa was 
0.77. 


2.3.1 Baselines 

We compare both the interactive and non-interactive data- 
driven example-based feedback against two baselines: Expert- 
authored examples and examples generated naively from 
Student data. For data-driven example-based feedback, we 
use the two strategies (DDF and DDF-IS) described in sec- 
tion 2.2.4. Our goal for the Expert baseline was to reflect 


? Available at: http: //go.ncsu. edu/edm2019-rubric 

3Due to the order in which examples were generated, Round 
1 included DDF and Expert examples, and Round 2 con- 
tained Student and DDF-IS examples. However, raters were 
blind to the condition of the example. 

“The rated examples were in DDF and Expert conditions. 
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a straightforward way of generating example-based feedback 
using a small number of expert-authored example pairs, 
which an instructor might reasonably create. This corre- 
sponds to the Next Step of the Nearest Sample Solution 
(NSNSS) strategy introduced by Gross et al. [9], which 
they found to be optimal. This strategy selects the next 
step of the nearest expert solution. To generate expert ex- 
ample pairs, two experts (specifically, two co-authors who 
did not rate the example pairs during evaluation) manually 
authored example pairs for all steps in the most common 
solution paths, which at least 10% of students took to solve 
the problem. Each of the two co-authors created the ex- 
ample code separately, based on the understanding that the 
example code should be useful to students and no extra ex- 
planations will be used to help students understand the sug- 
gested example code. During this process, the experts could 
review as many students’ code (including all the history) as 
they needed. Afterwards, they met and discussed all the 
example pairs that they authored and came to consensus on 
the example pairs. In the Squiral assignment, the common 
solution graph has 14 nodes and 13 edges, of which 7 were 
on a primary solution path and 6 were on two alternative 
paths. To select an Expert example-pair for a student, we 
first used the SoureCheck algorithm to sort the Expert ex- 
ample pairs based on the size of the “start code” and the 
similarity between their “start code” and the student’s cur- 
rent code. We select the example pair whose “start code” 
accomplishes fewer features than the student’s code and is 
very similar to the student’s code. 


It may seem unfair to compare the Interactive Data-driven 
examples to (non-interactive) Expert examples. However, 
we note that it would not be reasonable to create an Inter- 
active-Expert baseline. Since the Expert examples consist 
of a small number of hand-authored examples, which com- 
pleted features in a specific order (e.g. Feature 1 is always 
completed before Feature 2), it is not reasonable to give a 
student the option of selecting a desired example. If the stu- 
dent has only completed 2 features, they could simply select 
the example for the final feature and see a full solution. Since 
the goal of example-based feedback is to show only a single, 
incomplete feature, we always selected the closest Expert 
example-pair to the student’s current code. By contrast, our 
data-driven approach generates enough example-pairs that 
we are able to provide many choices of features to complete, 
without revealing any other features in the process. 


Our goal for the Student baseline was to reflect a naive ap- 
proach to extracting examples from student code, without 
the feature-based cleaning and selection of our own algo- 
rithm. Gross et al. [9] defined their baseline of student- 
derived examples to show only students’ submitted solu- 
tions, essentially giving away the whole answer. Instead, our 
baseline extracts multiple examples from students, based on 
when they ran their code. We hypothesized that students 
often run their code when they have completed a mean- 
ingful feature, making this a meaningful way to demarcate 
examples. We extracted examples from all correct student 
solution traces that did not request help. We selected stu- 
dent code snapshots based on when they ran their code. We 
treated consecutive run events within 15 seconds of one an- 
other as a single run event, and we took the last of these 
events as the boundary between example pairs. For each 


consecutive pair of run events (more than 15 seconds apart), 
we extract an example pair consisting of the two correspond- 
ing snapshots. The code snapshot that happened earlier 
serves as the “start code” of the example pair, and the one 
happened later serves as the “end code”. When selecting an 
example pair to show as feedback for a given student, we 
used the SourceCheck algorithm to find the nearest “start 
code” and present that example pair as feedback. 


3. RESULTS 


We structured our analysis around the following research 
questions: RQ1: Can we create useful example-based feed- 
back naively from student data, without a data-driven algo- 
rithm? RQ2: How does the quality of data-driven example- 


based feedback compare with that of expert-authored example- 
based feedback? RQ3: Can the quality of data-driven example- 


based feedback be improved if an example is selected inter- 
actively, rather than automatically? 


We address each RQ by comparing the quality of example- 
based feedback generated by four feedback approaches ex- 
plained above: 1) Expert-authored (Expert), 2) Naive Stu- 
dent Data (Student), 3) Data-Driven Features (DDF), and 
4) Data-Driven Features with Interactive Selection (DDF- 
IS). We evaluated quality in terms of Relevance, Progress, 
and Interpretability, as explained above. For RQ1, we hy- 
pothesized that our results would be consistent with Gross 
et al. [9] that naively extracted student examples would not 
lead to high-quality feedback. For RQ2, we hypothesized 
that data-driven example-based feedback would be more rel- 
evant to the student’s code than expert feedback and just as 
useful otherwise. For RQ3, we hypothesized that interactive 
selection would improve the quality of data-driven example- 
based feedback. 


3.1 Feature Coverage 

From 106 student solution traces, the DDF algorithm was 
able to generate 13,927 unique data-driven examples, as 
shown in Table 2. Of these, only 728 (5%) were derived 
directly from a student’s trace, and the rest were generated 
with the recursive algorithm explained in Section 2.2.3. It 
took around 10 minutes (645 seconds) to generate all the 
data-driven example pairs. Table 2 shows the number of ex- 
amples generated by each algorithm. It also gives the “snap- 
shot coverage” and “hint request coverage” of each algo- 
rithm. The former refers to the percent of all observed snap- 
shots which had an available example in the same feature- 
state (meaning the example started with the same set of fea- 
tures completed). For this calculation, we used the expert- 
authored features defined in [34]. Hint request coverage 
considers only the 50 hint request snapshots we evaluated. 
These numbers are averaged across the 3 semesters. 


Table 2: Total number of generated example pairs, corre- 
sponding average snapshot coverage and hint coverage in 
the expert-defined feature space. 


DDF & | Stud. 
DDF-IS 
# of examples generated | 13,927 


Snapshot coverage 0.843 0.670 


Hint request coverage 
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3.2 Expert Ratings 

Our dataset consists of 50 hint requests and 200 example 
pairs (one example pair per feedback approach for each hint 
request). All the example pairs were rated on four attributes 
on a scale of 1-3: Relevance, Progress, Appropriateness & 
Interpretability, and Similarity, as described in Section 2.3. 
We found the first three attribute ratings showed significant 
positive pairwise Spearman correlations ranging from 0.51 to 
0.81 (all p < 0.001). Similarity also had a lower, positive cor- 
relation with the other attributes, ranging from 0.24 to 0.34 
(all p < 0.001). Due to the high positive correlation of the 
Relevance, Progress, and Appropriateness & Interpretability 
attributes, we also compute a Quality attribute, which sums 
all three attributes, with scores ranging from 3 to 9. Because 
all 4 example-based feedback approaches selected examples 
using SourceCheck’s code similarity function, we also inves- 
tigated the relationship between SourceCheck’s calculated 
similarity and the expert-rated Similarity. We found a sig- 
nificant, positive correlation (p = 0.29; p < 0.001), suggest- 
ing that the similarity function is reasonable but could be 
improved. Table 3 reports mean values of each attribute for 
each example-based feedback approach’. 


Table 3: Mean attribute ratings (with standard deviation) 
for example pairs in from each approach. 


2.62 2.46 2.22 7.30 
2.24 2.36 2.14 6.74 


2.12 2.06 1.84 6.02 
Student | 1.72 2.12 1.80 5.64 


To address our research questions, for each attribute we used 
Kruskal-Wallis test to determine if there was a significant 
difference in ratings across feedback generation approaches. 
For the overall Quality attribute, we found a significant dif- 
ference among conditions (x?(3) = 16.06,p < 0.001). We 
performed a post hoc Dunn’s test with Benjamini-Hochberg 
correction for multiple comparisons® [2, 5] to identify pair- 
wise significant differences between approaches. This showed 
a significant difference between Expert and Student exam- 
ples (z = 2.48,p = 0.026,r = 0.25), DDF and DDF- 
IS (z = 2.76,p = 0.017,r = 0.28), DDF-IS and Student 
(z = 3.69, p = 0.0013, r = 0.37), suggesting that for overall 
quality, Student < DDF < DDF-IS, and Student < Expert. 


We then inspected the difference for each individual at- 
tribute. A Kruskal-Wallis test showed a significant differ- 
ence among approaches for the Relevance (x? (3) = 24.45, p < 
0.001). A post-hoc test using Dunn’s test with Benjamini- 
Hochberg correction showed a significant difference between 


> Although attribute ratings are ordinal, we report the mean 
and SD, since the median values for each attribute are gen- 
erally the same (2 or 3) 

®We report p-values corrected with the Benjamini-Hochberg 
procedure to control the false discovery rate at 0.05, keeping 
the a significance threshold at 0.05. 

"The effect size r is calculated as described in [25] 


DDF and Student (z = 2.14,p = 0.049,r = 0.21), Expert 
and Student (z = 2.79,p = 0.016,r = 0.28), DDF-IS and 
DDF (z = 2.76,p = 0.011,r = 0.28), DDF-IS and Stu- 
dent (z = 4.90,p < 0.001,r = 0.48), DDF-IS and Expert 
(z = 2.11,p = 0.04,r = 0.21), but we did not find a sig- 
nificant difference between DDF and Expert (z = 0.65, p = 
0.515,r = 0.07). This suggests that for Relevance, Student 
< DDF = Expert < DDF-IS. 


We also found a significant difference among conditions for 
the Appropriateness & Interpretability (y?(3) = 8.98,p = 
0.030). However, a post-hoc test using Dunn’s test with 
Benjamini-Hochberg correction did not find any pairwise 
significant differences between conditions. For the other 
two attributes, similarly, we did not find any significant dif- 
ference between conditions: Progress (x?(3) = 7.14,p = 
0.068); Similarity (y?(3) = 0.62,p = 0.89). The results 
suggest that DDF-IS example pairs are more relevant to 
student code and are equally helpful and interpretable com- 
pared with expert-authored examples. 


3.3 Inspection of Examples 

To better understand our quantitative results, we manually 
inspected examples generated by each approach to under- 
stand how they differed and how those differences impacted 
the expert ratings. We investigated the following questions 
and present situations that highlight possible answers: 


Why does the naive Student baseline have low qual- 
ity? Our quantitative results show that our naive Student 
baseline does not produce useful example pairs. We manu- 
ally investigated some Student example pairs to better un- 
derstand why. Recall that the Student baseline extracts 
examples showing the changes between consecutive run of 
student code. We derived 242 of these example pairs across 
3 semesters from 39 correct student submissions that did not 
request hints. Upon inspection, it is clear that this segmen- 
tation approach did not always produce meaningful example 
pairs. For example, a Student example pair might simply 
rearrange the order of some code blocks. This is a common 
debugging behavior, but it does not usually yield a useful 
example. We also found that even when Student example 
pairs demonstrated a meaningful step, they were often se- 
lected for students who had already completed that feature. 
This is not surprising, since the Student baseline selected 
the example with the most similar start code, but this does 
not guarantee the student will not have additional features 
completed. Lastly, many Student examples contained ex- 
traneous code that made them harder to interpret. These 
three problems — lack of meaningful steps, repeating com- 
pleted steps, and extraneous code ~— are all addressed in our 
DDF-based example selection. With good features, we can 
identify meaningful example steps, ensure that all examples 
complete new features, and remove extraneous code. 


When were data-driven examples more adaptive than 
Expert examples? Our quantitative results suggest that 
the DDF-IS approach was able to generate examples that 
were significantly more Relevant than those of the Expert 
approach. We hypothesized that this was enabled by the 
large number of unique example pairs generated by DDF 
(13,927). By contrast the Expert approach used only 13 
examples pairs, which covered the common solution paths 
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Student Code DDF-IS Expert 


Squiral(rotations, length): 


Squiral(rotations, length): 


Clear () PenDown() ral (rota 
PenDown() Repeat(4 x rotations) PenDown() 
Repeat(4 * rotations) a + e 
Move(10) 
Turn(90) Turn(90) Move(12) 
Change (length)by (25) Change(length)by(5) Turn(9@) 


PenUp() 


Figure 3: DDF-IS and Expert examples for one snapshot 
(red/green indicate changed code in the example). 


that at least 10% of students took to solve the problem. 
While some of DDF-IS’s improvement may have come from 
its Interactive Selection step (explored below), our results 
in Section 3.1 also show DDF examples have larger cover- 
age than the expert-authored examples for both the hint 
requests and student snapshots. Our manual inspection 
uncovered a number of situations when a relevant expert 
Example was not available, but a DDF example was. For 
example, a strong minority of students solved the Squiral 
problem by using a second input (parameter) to store the 
side length of shape. Figure 3 shows that for the same hint 
request, the DDF-IS can suggest examples not only similar 
to the student code, but also relevant to what the student is 
working on. However, the expert-authored example only has 
one parameter in the custom block and suggests something 
that the student has already done. The Expert example 
becomes less helpful and irrelevant when a student has solu- 
tion that deviates from the Expert examples. However, our 
data-driven examples are more adaptive in this scenario and 
can provide examples both similar and relevant to the stu- 
dent. We note that this adaptivity was enabled in large part 
by the recursive example generation algorithm outlined in 
Section 2.2.3. Only 2 of the 50 examples selected by DDF-IS 
were extracted directly from student code traces (the “naive 
approach”); the other 48 were generated synthetically. 


When did DDF-IS select better examples than DDF? 
Since DDF and DDF-IS were selecting examples from the 
same pool of examples, but DDF-IS has significantly higher 
overall quality, we wanted to understand where the selection 
algorithms differed. We investigated some pairs and focused 
on when our algorithm failed to select relevant examples. In 
those scenarios, we found that the DDF would suggest the 
student to add a ‘pen down’, or add a custom block in the 
main script area. Those features are necessary for a correct 
solution, but they may have lower priority compared with 
other features such as “control the sprite to draw a square” 
or “repeat the spiral the correct number of times.” Addi- 
tionally, we also found that the data-driven feature detector 
sometimes counted a feature as complete when it still had 
a bug or a missing block. Thus, it would suggest example 
pairs that accomplish new features without fixing the bro- 
ken one. In the opposite case, the student may have already 
finished a feature, but the feature detector sometimes failed 
to detect it, showing a redundant example. This occurred 
frequently when students accomplished a feature in a unique 
way, not captured by our data-driven feature definition. For 
example, a few students attempted to use a recursive ap- 
proach to solve Squiral, but since this was quite uncommon, 
our data-driven features failed to generate meaningful ex- 
ample pairs for this solution. The DDF-IS can help resolve 
the first two cases, since students can choose whether they 


need help on a feature that the DDF failed to detect. 


4. DISCUSSION 


RQ1: Can we create useful example-based feedback 
naively from student data, without a data-driven 
algorithm? Our results suggest that naively extracting 
examples from student code based on when students ran 
their code does not yield high-quality example-pairs. These 
naive Student examples were lowest or second-lowest, rated 
on every dimension, with significantly lower Relevance than 
all other approaches and significantly lower overall Quality 
than the DDF-IS approach and the Expert-authored exam- 
ples. This is consistent with our hypothesis for RQ1. It 
is also in agreement with the work of Gross et al. [9]., 
who found that using student solutions as example-based 
feedback were rated lower by experts and led to less im- 
provement in students than other, expert-authored exam- 
ples. However, rather than using a student’s submitted code 
as an example as Gross et al. did, we used the changes that 
students made between running their code. We believe that 
this represents a more reasonable student baseline, since it 
only shows a part of the answer like the other feedback ap- 
proaches. It should also theoretically limit the cognitive load 
needed for students to learn from the examples. Our results 
show that even so, a naive approach does not create high- 
quality student examples, compared to other approaches. 
Our data-driven approach attempts to address this by cre- 
ating examples based on when students completed features. 


One of the goals of using student data to generate examples 
is that they can more closely match the student’s current 
code. In fact, we were able to extract 242 unique Student 
examples, compared with 13 for the Experts. This did en- 
able our system to identify Student examples that were more 
Similar to the help-requesting student’s code than Expert 
examples (though not significantly more). However, it is 
also clear that Similarity alone did not translate into Rele- 
vance, since the Student examples had the lowest Relevance 
scores. We cannot guarantee that students make meaning- 
ful changes between two consecutive snapshots, and thus the 
generated example pairs may not accomplish a meaningful 
chunk of code. For example, the Student example pair may 
suggest something that a student has already done. In this 
case, we could use the data-driven features to identify a later 
snapshot with more relevant changes as the “end code.” This 
suggests the need for a more deliberate, data-driven process 
that can still create a large number of examples to select 
from without sacrificing quality. 


RQ2: How does the quality of data-driven example- 
based feedback compare with that of expert-authored 
example-based feedback? We hypothesized that our 
Data-Driven Features (DDF) examples would be more rele- 
vant to help-requesting student’s code than the expert base- 
line, with similar levels of Progress and Interpretability. How- 
ever, our results did not support this hypothesis. There were 
no significant differences between the two approaches, with 
the DDF algorithm performing similarly on Relevance and 
slightly worse on Progress and Interpretability. Our original 
hypothesis was based on the premise that data-driven exam- 
ples would be more Relevant, since they are selected from 
a much larger database of examples (4,000 to 5,000) that 
represents a large variety of student code. While this was 
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not confirmed, our results do suggest that our current, non- 
interactive data-driven algorithm performs significantly bet- 
ter than naive Student examples and only marginally worse 
than Expert examples. Since it is often difficult to get in- 
structors to author examples [12], this suggests our DDF 
examples should be a reasonable substitution. 


There are a few possible explanations for the poorer per- 
formance of DDF. First, it is possible that Squiral problem 
we analyzed may not have been complex enough to neces- 
sitate generating a large variety of examples, which is one 
of the key proposed advantages of our algorithm. If most 
hint-requesting students performed the steps of the problem 
in the same order as the Expert examples, these Expert ex- 
amples would generally be Relevant, leaving less room for 
the DDF examples to improve upon them. We have some 
support for this explanation, since across semesters 53.8% 
to 85.7% of the hint requests we analyzed had an expert 
feature state that matched one of our Expert examples ex- 
actly. However, we also note that the expert baseline was 
rated less than 3 for Relevance 46% of the time, so there 
was clearly room for improvement, as we discuss later. 


It is also clear that the DDF approach did not perform as 
well as expected, scoring between the Expert and Student 
baselines for Progress and Interpretability. There are two 
possible explanations for this: either the algorithm is failing 
to generate good example pairs, or it is generating useful 
example pairs, but failing to select the best pair to show 
a given student. The higher scores of the DDF-IS algo- 
rithm, which generated the same example pairs as DDF, 
but allowed a human to select the best one, suggest that 
the problem lies in the DDF algorithm’s selection of exam- 
ple pairs. This selection process (explained in Section 2.2.4) 
primarily uses the SourceCheck algorithm to identify the 
example with the most similar start state to the student’s 
code. While this approach did lead to DDF having the high- 
est Similarity ratings, there was only a p = 0.238 correlation 
between Relevance and Similarity. This suggests that Sim- 
ilarity alone is not a good proxy for example Relevance or 
overall quality. The challenge of automatically selecting an 
appropriate example also reflects prior work on data-driven 
feedback [23], where low-quality hints arose because the hint 
generation algorithm was unable to identify the most useful 
hints and filter out less useful hints. 


RQ3: Can the quality of data-driven example-based 
feedback be improved if an example is selected inter- 
actively, rather than automatically? Our results show 
that the quality of data-driven example-based feedback can 
be improved if the examples are selected interactively. The 
DDF-IS algorithm scored significantly higher than the DDF 
algorithm on Quality and Relevance, even significantly out- 
performing the expert baseline on Relevance, with the best 
overall scores. Recall that for the DDF-IS examples, we 
manually tagged the data-driven features with natural lan- 
guage labels, which a student could reasonably select from 
when making an example request. In this preliminary eval- 
uation, we simulated this selection process by manually se- 
lecting the best data-driven feature to show to each student. 
Importantly, we only selected the feature (e.g. “move the 
sprite using a variable”), not the example, and the algorithm 
still selected the best possible example for a given feature 


(from hundreds of possible choices). Still, it is likely that an 
expert familiar with the system is more likely to choose an 
appropriate feature than a student, so this represents an up- 
per bound for how well an interactive, data-driven example 
selection algorithm could reasonably perform. These results 
show that, when this proper selection is applied, the larger 
database of example pairs generated by our algorithm can 
lead to more Relevant examples than a static set of expert- 
authored examples. This represents a growing trend among 
data-driven systems that support programming to leverage 
human (i.e. student and instructor) knowledge in conjunc- 
tion with data-driven algorithms [10, 18, 33]. However, from 
our preliminary expert evaluation, we can not make claims 
about how students would actually react to such a choice. 
Our results also suggest ways that we can improve the au- 
tomated example selection procedure. The majority of low- 
quality DDF example pairs would select a feature that the 
student needs but could be less relevant than other features. 
For example, one example pair suggests adding the student’s 
custom block into the main script so they can test it. How- 
ever, in Snap/, students can click on the custom block to 
test it directly. This feature should therefore have lower 
priority than others. For the example shown in the Results 
section, the data-driven feature detector failed to detect cor- 
rect variations in student code, and suggested a feature that 
the student had already completed. These examples suggest 
that it may be more important to select the most important 
feature to demonstrate, before considering the similarity of 
the example to the student’s current code. 


5. CONCLUSION 


This study presents a method to generate data-driven exam- 
ples automatically from historical student data. We eval- 
uated two versions of data-driven generated examples us- 
ing the student code when they need help, compared with 
two baselines: examples authored by experts and exam- 
ples derived naively from student solution traces. Experts 
evaluated all of the example pairs based on a multidimen- 
sional rubric. Our preliminary results demonstrate that 
by selecting appropriate features, our data-driven exam- 
ples can be more relevant than both baselines while re- 
taining the usefulness and interpretability. These promis- 
ing results suggest that our method can generate adaptive, 
data-driven examples automatically with quality similar to 
that of expert-authored examples. Even though our method 
is based on a block-based programming environment, it is 
language-agnostic and may also be generalized to textual 
programming languages; though further studies are needed 
to verify this proposed generalizability. 


Limitations: This preliminary study relied on experts rat- 
ings to assess example quality, yielding valuable insight, but 
future work is needed to determine whether these results 
translate into improved student performance and learning. 
For example, even a well-rated example pair may still give 
away too much of the answer and impede learning. We 
also do not know how our results will generalize beyond 
the single assignment evaluated here, since we only applied 
this method to a short, block-based problem with a solution 
that is usually 7-10 lines of code. In our DDF-IS condi- 
tion, we had an expert interactively select the feature to 
be presented, rather than a student. As noted earlier, this 
is likely an optimistic implementation, as we do not know 
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whether students can effectively select examples. Finally, all 
conditions used the SourceCheck algorithm to select similar 
examples to the student’s code, but other approaches may 
better capture example relevance (e.g. [11]). We are cur- 
rently planning a student evaluation of the DDF algorithm 
to address these limitations. 
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